##Overview In this report I detail the machine learning (ML) models I implemented to accurately predict the housing prices in Boston suburbs. The data set for this experiment is accessed from the UCI Machine Learning repository via https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the data set to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of the models, etc.

Introduction

The data set consists of 506 observations of 14 attributes. The median value of house price in $10000s, denoted by MEDV, is the outcome or the dependent variable in our models. Below is a brief description of each predictor feature and the outcome in our data set: variables:

  1. CRIM – per capita crime rate by town
  2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft
  3. CHAS – Charles River dummy variable (1 if tract bounds river; else 0)
  4. NOX – nitric oxides concentration (parts per 10 million)
  5. RM – average number of rooms per dwelling
  6. AGE – proportion of owner-occupied units built prior to 1940
  7. DIS – weighted distances to five Boston employment centers
  8. RAD – index of accessibility to radial highways
  9. INDUS – proportion of non-retail business acres per town
  10. TAX – full-value property-tax rate per $10,000
  11. PTRATIO – pupil-teacher ratio by town
  12. B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT – % lower status of the population
  14. MEDV – Median value of owner-occupied homes in $10000’s (response variable)

Getting and Cleaning the Data

Getting the data into R as an R object, cleaning the data and transforming it as a neat and usable R data frame or equivalent. The df (Boston housing) file consists of the actual data, The R-function readLiness() reads data from fixed-width files. Below, I use subsetting and the strsplit R function to extract the columns/predictors and column names alone from this file.

# DATA DOWNLOWNED
text <- readLines("boston.txt")
text <- text[c(-1, -2)]

PREPROCESSING/TRANSFORM DATA

i = 1
df2 <- NULL
while (i <= 1012) {
    if (i%%2 == 0) {
        i = i + 1
    } else i
    j <- i + 1
    texti <- as.numeric(strsplit(text, " ")[[i]])
    texti <- na.omit(texti)
    textj <- as.numeric(strsplit(text, " ")[[j]])
    textj <- na.omit(textj)
    textC <- as.vector(c(texti, textj))
    df <- NULL
    df <- rbind(df2, textC)
    colnames(df) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", 
        "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
    rownames(df) <- c()
    df2 <- df
    i <- j + 1
    df <- as.data.frame(df)
}

# first 20 observations
head(formattable(df), 10)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9

EXPORATORY DATA ANALYSIS

Now, let us check and explore the cleaned data frame containing the housing data. The df R object is of class ‘data.frame’, which is very easy to work with using R scripts. The str() function is powerful in displaying the structure of an R dataframe. Below, the output of str() compactly provides the relevant information of our dataframe, like the number of observations, number of variables, names of each column, the class of each column, and sample values from each column.

DiSPLAY SUMMARY STATISTICS OF DATA

## DiSPLAY SUMMARY STATISTICS
summary(df)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

DiSPLAY SUMMARY STRUCTURE OF HOUSING DATA (df)

str(df)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Let us visualize the distribution and density of the outcome, MEDV. The black curve represents the density. In addition, the boxplot is also plotted to bring an additional perspective. We see that the median value of housing price is skewed to the right, with a number of outliers to the right. It may be useful to transform ‘MEDV’ column using functions like natural logarithm, while modeling the hypothesis for regression analysis.

p01 <- ggplot(df, aes(x = MEDV)) + xlab("Median value of owner-occupied homes") + 
    geom_histogram(aes(y = ..density..), binwidth = 1, fill = "green") + geom_density(alpha = 0.8, 
    fill = "red") + ylim(0, 0.075)
ggplotly(p01 = ggplot2::last_plot())
p02 <- ggplot(df, aes(y = MEDV, x = RM)) + xlab("average number of rooms per dwelling") + 
    ylab("Median value of owner-occupied homes") + geom_point(colour = "red") + 
    geom_smooth(method = lm, colour = "blue") + geom_smooth(colour = "orange")
ggplotly(p02 = ggplot2::last_plot())
p03 <- ggplot(df, aes(y = MEDV, x = LSTAT)) + xlab("%lower status of the population") + 
    ylab("Median value of owner-occupied homes") + geom_point(colour = "green") + 
    geom_smooth(method = lm, colour = "yellow") + geom_smooth(colour = "red")
ggplotly(p03 = ggplot2::last_plot())
p04 <- ggplot(df, aes(y = MEDV, x = PTRATIO)) + xlab("pupil-teacher ratio by town") + 
    ylab("Median value of owner-occupied homes") + geom_point(colour = "black") + 
    geom_smooth(method = lm, colour = "yellow") + geom_smooth(colour = "green")
ggplotly(p04 = ggplot2::last_plot())
b0x <- ggplot(df, aes(y = MEDV, x = c(1:506))) + geom_boxplot(alpha = 0.8, colour = "black", 
    fill = "green") + coord_flip()
b0x
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Now, let us do scatter plot of some of the important variables (based on intuition) with the outcome variable MEDV. We see that there is strong positive or negative correlation between these variables and the outcome. It is also obviously evident that INDUS and NOX are strongly positively correlated with one another, as nitric oxide levels tend to go up with increase in industries.

plot(df[,c(3,5,6,11,13,14)],pch = 3)

Correlation and near zero variance A few important properties to check now are the correlation of input features with the dependent variable, and to check if any feature has near zero variance (values not varying much within the column). as.

Correlation of each independent variable with the dependent variable

suppressMessages(library(caret))
cor(df, df$MEDV)
##               [,1]
## CRIM    -0.3883046
## ZN       0.3604453
## INDUS   -0.4837252
## CHAS     0.1752602
## NOX     -0.4273208
## RM       0.6953599
## AGE     -0.3769546
## DIS      0.2499287
## RAD     -0.3816262
## TAX     -0.4685359
## PTRATIO -0.5077867
## B        0.3334608
## LSTAT   -0.7376627
## MEDV     1.0000000

We see that the number of rooms RM has the strongest positive correlation with the median value of the housing price, while the percentage of lower status population, LSTAT and the pupil-teacher ratio, PTRATIO, have strong negative correlation. The feature with the least correlation to MEDV is the proximity to Charles River, CHAS.

Calulate near zero variance

nzv <- nearZeroVar(df, saveMetrics = TRUE)
sum(nzv$nzv)
## [1] 0

The output shows that there are no variable with zero or near zero variance.

Feature Engineering and Data Partitioning

Next we perform centering and scaling on the input features. Then we partition the data on a 7/3 ratio as training/test data sets.

 #enteringcaling of input features
df <- cbind(scale(df[1:13]), df[14])

Setting seed to maintain reproducibity

`set.seed(12345)

CREATE TEST & TRAIN DATA PARTITIONS

inTrain <- createDataPartition(y = df$MEDV, p = 0.70, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]

DEVELOPING A MODEL FOR PREDICTION

I’m going to develop the tools and techniques necessary for a model to make a prediction. Being able to make accurate evaluations of each model’s performance helps to greatly reinforce the confidence in my predictions.

Defining a Performance Metric

R2 (R-squared) ,aka “Coefficient of determination. The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.

RMSE (Root Mean Square Error) is also a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. Similar to R2, its values ramge from 0 to 1. However, the differene from R2 is that closer RMSE is 0, indicates a better fit to regression, implying better prediction accuracy.

LINEAR REGRESSIONS

First, let us try generalized linear regression model with MEDV as the dependent variable and all the remaining variables as independent variables. We train the model with the training data set. For this linear model, below is the coefficients of all the features, and the intercept. Next, we use the trained model to predict the outcome (MEDV) for the testing data set. A good metric to test the accuracy of the model is to calculate the root-mean squared error, which is given by :

\[\sqrt{\sum_{i=1}^{n} \frac{(y_{pred_i} - y_{act_i})^2} {n}}\]

##LINEAR REGRESSION MODEL 1 - using all features

set.seed(12345)
fit.lm <- lm(MEDV ~ ., data = training)

fit.lm
## 
## Call:
## lm(formula = MEDV ~ ., data = training)
## 
## Coefficients:
## (Intercept)         CRIM           ZN        INDUS         CHAS  
##     22.7369      -0.9232       0.9084       0.1753       0.8109  
##         NOX           RM          AGE          DIS          RAD  
##     -2.6324       2.5781       0.3322      -3.2202       2.2919  
##         TAX      PTRATIO            B        LSTAT  
##     -1.8564      -2.2865       0.6725      -3.6271
autoplot(fit.lm, which = 1:6, ncol = 3, label.size = 3)

#### CHECK COEFFICIENTS

data.frame(coef = round(fit.lm$coefficients, 2))
##              coef
## (Intercept) 22.74
## CRIM        -0.92
## ZN           0.91
## INDUS        0.18
## CHAS         0.81
## NOX         -2.63
## RM           2.58
## AGE          0.33
## DIS         -3.22
## RAD          2.29
## TAX         -1.86
## PTRATIO     -2.29
## B            0.67
## LSTAT       -3.63
summary(fit.lm)
## 
## Call:
## lm(formula = MEDV ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.911  -2.737  -0.724   1.839  26.275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.7369     0.2529  89.899  < 2e-16 ***
## CRIM         -0.9232     0.3071  -3.006 0.002839 ** 
## ZN            0.9084     0.3875   2.344 0.019647 *  
## INDUS         0.1753     0.5193   0.338 0.735852    
## CHAS          0.8109     0.2586   3.135 0.001866 ** 
## NOX          -2.6324     0.5379  -4.894 1.53e-06 ***
## RM            2.5781     0.3520   7.324 1.74e-12 ***
## AGE           0.3322     0.4511   0.736 0.462006    
## DIS          -3.2202     0.5051  -6.376 5.90e-10 ***
## RAD           2.2919     0.6855   3.343 0.000919 ***
## TAX          -1.8564     0.7681  -2.417 0.016177 *  
## PTRATIO      -2.2865     0.3312  -6.903 2.48e-11 ***
## B             0.6725     0.3033   2.217 0.027255 *  
## LSTAT        -3.6271     0.4349  -8.339 1.85e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.753 on 342 degrees of freedom
## Multiple R-squared:  0.7491, Adjusted R-squared:  0.7396 
## F-statistic: 78.54 on 13 and 342 DF,  p-value: < 2.2e-16
ggplot(data = df, aes(df$NOX, df$MEDV)) + geom_smooth()

set.seed(12345)

# predict on test set
pred.lm <- predict(fit.lm, newdata = testing)

# Root-mean squared error
rmse.lm <- sqrt(sum((pred.lm - testing$MEDV)^2)/length(testing$MEDV))

c(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared, P_value = summary(fit.lm)$coefficients[1, 
    4])
##          RMSE            R2       P_value 
##  4.826464e+00  7.490908e-01 5.003731e-240
data.frame(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared, P_value = summary(fit.lm)$coefficients[1, 
    4])
##       RMSE        R2       P_value
## 1 4.826464 0.7490908 5.003731e-240

We see that the RMSE is 4.381992 and the R2R2 value is 0.7239 for this model.

LINEAR REGRESSION MODEL 2

We also saw that the output variable MEDV was skewed to the right. Performing a log transformation would normalize the distribution of MEDV. Let us perform glm with log(MEDV) as the outcome and all remaining features as input. We see that the RMSE value has reduced for this model.

Try linear model using significant features

set.seed(12345)

fit.lm1 <- glm(log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B + 
    LSTAT, data = training)


# predict on test set
summary(fit.lm1)
## 
## Call:
## glm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + 
##     RAD + B + LSTAT, data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.72099  -0.09896  -0.01041   0.09383   0.82814  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.04204    0.01042 291.842  < 2e-16 ***
## CRIM        -0.09118    0.01260  -7.233 3.05e-12 ***
## CHAS         0.03391    0.01057   3.208  0.00146 ** 
## NOX         -0.11597    0.01988  -5.833 1.25e-08 ***
## RM           0.06886    0.01395   4.937 1.24e-06 ***
## DIS         -0.10563    0.01698  -6.220 1.43e-09 ***
## PTRATIO     -0.09544    0.01274  -7.492 5.71e-13 ***
## RAD          0.04143    0.01745   2.375  0.01811 *  
## B            0.03312    0.01247   2.655  0.00830 ** 
## LSTAT       -0.19588    0.01680 -11.657  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03839931)
## 
##     Null deviance: 62.996  on 355  degrees of freedom
## Residual deviance: 13.286  on 346  degrees of freedom
## AIC: -138.32
## 
## Number of Fisher Scoring iterations: 2
# predict on test set
pred.lm1 <- predict(fit.lm1, newdata = testing)
pred.lm1
##        1        2        6       11       12       20       22       25 
## 3.424905 3.215083 3.243884 2.901499 3.071125 2.936919 2.878708 2.789241 
##       26       28       30       36       39       42       47       62 
## 2.714936 2.737635 2.981513 3.155189 3.119216 3.345558 3.028396 2.884679 
##       70       72       73       80       81       83       85       86 
## 3.115354 3.128846 3.264412 3.191590 3.328428 3.247598 3.201864 3.325643 
##       87       92       94       95       98      100      102      105 
## 3.093660 3.311999 3.336809 3.223000 3.599877 3.482120 3.221758 3.054268 
##      112      114      118      120      121      126      127      129 
## 3.270493 3.025031 3.194205 3.073372 2.980889 2.990914 2.613032 2.904226 
##      137      138      144      148      152      154      156      157 
## 2.800400 2.925245 2.531525 2.407082 2.884173 2.807819 2.921925 2.683686 
##      161      165      169      170      179      180      186      188 
## 3.571201 3.209312 3.249681 3.254000 3.445617 3.475009 3.145311 3.536334 
##      198      199      201      204      205      206      219      220 
## 3.458007 3.542799 3.380556 3.691603 3.734202 3.120015 3.159950 3.379877 
##      221      223      227      229      230      236      239      240 
## 3.493257 3.456289 3.630410 3.563359 3.456111 3.198129 3.341913 3.327938 
##      242      246      247      254      255      259      265      269 
## 3.136338 2.729533 3.027688 3.352089 3.173883 3.531687 3.511827 3.698986 
##      270      271      281      284      285      291      294      299 
## 3.200219 3.049443 3.678104 3.861841 3.372352 3.415830 3.295066 3.383298 
##      302      303      311      312      313      314      316      317 
## 3.329713 3.338158 2.963462 3.289098 3.117618 3.223568 3.036883 2.864571 
##      324      328      335      336      343      344      345      349 
## 2.992317 2.967962 3.054559 3.019671 3.202453 3.265521 3.313382 3.229837 
##      352      354      359      366      371      372      378      379 
## 3.078071 3.117479 3.010418 2.844805 3.531225 3.102840 2.793666 2.537935 
##      380      383      384      389      393      397      398      408 
## 2.635077 2.572205 2.559339 2.251644 2.431292 2.810711 2.712854 2.896913 
##      409      411      423      425      429      430      431      432 
## 2.601788 2.487682 2.842687 2.705089 2.607256 2.519765 2.801332 2.773441 
##      433      434      439      442      448      450      457      458 
## 2.982320 2.743538 2.106845 2.695378 2.756805 2.716869 2.585770 2.588698 
##      466      467      468      472      473      481      485      486 
## 2.865484 2.691547 2.775159 3.080388 3.026308 3.105431 2.958702 3.056963 
##      491      497      500      501      502      506 
## 2.416903 2.732105 2.936100 2.997094 3.090174 3.084712
range(pred.lm1)
## [1] 2.106845 3.861841
# Root-mean squared error
rmse.lm1 <- sqrt(sum((exp(pred.lm1) - testing$MEDV)^2)/length(testing$MEDV))

c(RMSE = rmse.lm1, R2 = summary(fit.lm1)$r.squared, P_value = summary(fit.lm1)$coefficients[1, 
    4])
##     RMSE  P_value 
## 4.612483 0.000000
c(RMSE = rmse.lm1/10, R2 = summary(fit.lm1)$r.squared/10)
##      RMSE 
## 0.4612483

We see that the RMSE is 4.381992 and the R2 value is 0.7239 for this model.

Let us examine the calculated p-value for each feature in the linear model. Any feature which is not significant (p<0.05) is not contributing significantly for the model, probably due to multicollinearity among other features. We see that the features, ZN, INDUS, and AGE are not significant. {r, echo=TRUE, message=FALSE,tidy=TRUE} vif(fit.lm1) ``` Variance inflation factors are computed using vif() for the standard errors of linear model coefficient estimates. It is imperative for the vif to be less than 5 for all the features. We see that the vif is greater than 5 for RAD and TAX.

log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B + LSTAT

{r, echo=TRUE, message=FALSE,tidy=TRUE}

This model is marginally less accurate than linear model 2, based on slight increase in RMSE and slight decrease in R2R2 value. Let us plot the predicted vs actual values of the outcome MEDV.

LINEAR MODEL 2 PLOT of Predicted Prices vs Actual Prices

plot(pred.lm1, testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price")

# diagnostics plots
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(fit.lm1)

@@#LINEAR MODEL 2 – TABLE Table Shoming 1st six observations – Actual vs Predicted Price

table <- data.frame(x = pred.lm1, y = testing$MEDV)
names(table) <- c(xlab = "Predicted_Price", ylab = "Actual_Price")
data.table(table, 6)
##      Predicted_Price Actual_Price V2
##   1:        3.424905         24.0  6
##   2:        3.215083         21.6  6
##   3:        3.243884         28.7  6
##   4:        2.901499         15.0  6
##   5:        3.071125         18.9  6
##  ---                                
## 146:        2.732105         19.7  6
## 147:        2.936100         17.5  6
## 148:        2.997094         16.8  6
## 149:        3.090174         22.4  6
## 150:        3.084712         11.9  6

RANDOM FOREST MODEL

For random forest implementation, we could use the linear model formula of MEDV ~ . (meaning MEDV is the outcome with all other features as input). Inspecting the results, we see that the random forest model has given the best accuracy so far. Better model peformance is expected.

library(randomForest)
set.seed(12345)
fit.rf <- randomForest(formula = MEDV ~ ., data = training)
fit.rf
## 
## Call:
##  randomForest(formula = MEDV ~ ., data = training) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.93118
##                     % Var explained: 86.2
pred.rf <- predict(fit.rf, testing)

rmse.rf <- sqrt(sum(((pred.rf) - testing$MEDV)^2)/length(testing$MEDV))
table <- c(RMSE = rmse.rf, Rsquared = mean(fit.rf$rsq))
print(table)
##     RMSE Rsquared 
## 3.365975 0.855014

RANDOM FOREST PLOT – Predicted Prices vs Actual Prices

plot(pred.rf, testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price", 
    pch = 3)

RANDOM FOREST MODEL TABLE

Table Shoming 1st TEN observations – Actual vs Predicted Price

table1 <- data.frame(x = pred.rf, y = testing$MEDV)
names(table1) <- c("Predicted_Price", ylab = "Actual_Price")
head(kable(table1), 10)
##  [1] "       Predicted_Price   Actual_Price"
##  [2] "----  ----------------  -------------"
##  [3] "1            29.033943           24.0"
##  [4] "2            23.418307           21.6"
##  [5] "6            27.781137           28.7"
##  [6] "11           21.165063           15.0"
##  [7] "12           21.218273           18.9"
##  [8] "20           19.807877           18.2"
##  [9] "22           18.389208           19.6"
## [10] "25           17.227528           15.6"

MODEL COMPARISON

Linear_Model_1 <- c(RMSE = rmse.lm/10, R2 = summary(fit.lm)$r.squared)
Linear_Model_2 <- c(RMSE = rmse.lm1/10, R2 = summary(fit.lm1)$r.squared)/10
Random_Forest_Model <- c(RMSE = rmse.rf, R2 = mean(fit.rf$rsq))

model_comparison <- rbind(Linear_Model_1, Linear_Model_2, Random_Forest_Model)
kable(model_comparison)
RMSE R2
Linear_Model_1 0.4826464 0.7490908
Linear_Model_2 0.0461248 0.0461248
Random_Forest_Model 3.3659750 0.8550140

Conclusion

We experimented with a couple of linear regression models and a random forest model to predict the housing prices in Boston suburbs. Among these models, the Random forest model with a simple linear relationship between the outcome and all input features yielded the best model to predict outcomes, as determined from having the smallest RMSE (i.e., root mean squared error) and the highest R2 (i.e., accuracy | R-squared statistic), and the smallest p-value (greatest significance).